inverse hessian
Batch Acquisition Function Evaluations and Decouple Optimizer Updates for Faster Bayesian Optimization
Irie, Kaichi, Watanabe, Shuhei, Onishi, Masaki
Bayesian optimization (BO) efficiently finds high-performing parameters by maximizing an acquisition function, which models the promise of parameters. A major computational bottleneck arises in acquisition function optimization, where multi-start optimization (MSO) with quasi-Newton (QN) methods is required due to the non-convexity of the acquisition function. BoTorch, a widely used BO library, currently optimizes the summed acquisition function over multiple points, leading to the speedup of MSO owing to Py-Torch batching. Nevertheless, this paper empirically demonstrates the suboptimality of this approach in terms of off-diagonal approximation errors in the inverse Hessian of a QN method, slowing down its convergence. To address this problem, we propose to decouple QN updates using a coroutine while batching the acquisition function calls. Our approach not only yields the theoretically identical convergence to the sequential MSO but also drastically reduces the wall-clock time compared to the previous approaches. Our approach is available in GPSampler in Optuna, effectively reducing its computational overhead.
Reviews: Distributed estimation of the inverse Hessian by determinantal averaging
Additional Comments: • Overall, the article is well-written and structured. It has a clear contribution and also significant theoretical justification. There are only a few mistyping and grammatical errors: Line 17: "tranformation" "transformation" Line 133: " its entries is of …" "its entries are of …" Line 146: " accross" "across" Line 150: " emprical" "empirical" • In line 54, it is better to define in the theorem 2 along with . If there is no constraint on the number of estimators, this means choosing subsamples is with replacement. However, in line 158, the authors claim that their method is without replacement.
Stochastic Quasi-Newton Optimization in Large Dimensions Including Deep Network Training
Suman, Uttam, Mamajiwala, Mariya, Saxena, Mukul, Tyagi, Ankit, Roy, Debasish
Our proposal is on a new stochastic optimizer for non-convex and possibly non-smooth objective functions typically defined over large dimensional design spaces. Towards this, we have tried to bridge noise-assisted global search and faster local convergence, the latter being the characteristic feature of a Newton-like search. Our specific scheme -- acronymed FINDER (Filtering Informed Newton-like and Derivative-free Evolutionary Recursion), exploits the nonlinear stochastic filtering equations to arrive at a derivative-free update that has resemblance with the Newton search employing the inverse Hessian of the objective function. Following certain simplifications of the update to enable a linear scaling with dimension and a few other enhancements, we apply FINDER to a range of problems, starting with some IEEE benchmark objective functions to a couple of archetypal data-driven problems in deep networks to certain cases of physics-informed deep networks. The performance of the new method vis-\'a-vis the well-known Adam and a few others bears evidence to its promise and potentialities for large dimensional optimization problems of practical interest.
Distributed estimation of the inverse Hessian by determinantal averaging
In distributed optimization and distributed numerical linear algebra, we often encounter an inversion bias: if we want to compute a quantity that depends on the inverse of a sum of distributed matrices, then the sum of the inverses does not equal the inverse of the sum. An example of this occurs in distributed Newton's method, where we wish to compute (or implicitly work with) the inverse Hessian multiplied by the gradient. In this case, locally computed estimates are biased, and so taking a uniform average will not recover the correct solution. To address this, we propose determinantal averaging, a new approach for correcting the inversion bias. This approach involves reweighting the local estimates of the Newton's step proportionally to the determinant of the local Hessian estimate, and then averaging them together to obtain an improved global estimate.